Modern high-performance computing faces a fundamental "Memory Wall": the explosive growth in computational throughput (FLOPS) has far outpaced the modest increases in global memory bandwidth. This discrepancy turns massive multicore arrays into "starved" processors waiting for data.
1. The Bandwidth Gap
While a GPU can perform trillions of operations per second, the physical path to DRAM is constrained by pin density and power requirements. Memory as a Limiting Factor to Parallelism means that as you scale thread counts, the per-thread bandwidth drops, leading to stall cycles where hardware sits idle.
2. The Kitchen Analogy
Imagine a state-of-the-art kitchen (the GPU cores) capable of cooking 1,000 meals/hour. However, the ingredients are in a warehouse (global memory) five miles away, and there is only one delivery scooter (the memory bus). No matter how many chefs you hire, your output is capped by the scooter's speed.
3. Architectural Contrast
A standard multicore CPU system uses massive caches to hide latency for a few heavy threads. Massive parallel architectures, however, face a constant "traffic jam" of concurrent requests. Resource limitations at the register and shared memory level dictate the maximum level of parallelism (occupancy) attainable before hardware is overwhelmed.